-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the location and filename of schema.pbtxt to .merlin/schema.json #249
Conversation
tests/unit/io/test_io.py
Outdated
assert os.path.join(path, "_metadata") in data_file_list | ||
metadata_file_list = glob.glob(os.path.join(path, ".merlin", "*")) | ||
assert os.path.join(path, ".merlin", "_file_list.txt") in metadata_file_list | ||
assert os.path.join(path, ".merlin", "_metadata.json") in metadata_file_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjzamora This PR looks good to me, but I wanted to run these files by you before we move them. Can these move to the .merlin
directory without breaking things that depend on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlhigley Thanks for catching this. I think these are metadata files for hugeCTR, and I was maybe thinking hugeCTR is also Merlin so... After double-checking some of hugeCTR code, I think moving those files might break hugeCTR. I reverted those changes in a4c7cef to be safe so that these are still in the same directory as the parquet files, and only schema.json
goes into .merlin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that in the longer term they probably should get consolidated into the Merlin metadata directory, but this does seem safer for now 🤝
Fixes NVIDIA-Merlin/Merlin#845.
This PR moves Merlin-specific metadata files, such as schema, to a sub-directory
.merlin
in an effort to standardize the location and file format of Merlin schema. The format of the schema file is also changed to JSON.Models Tests (CPU)
in the GitHub Actions CI is failing, but this is expected as Models hosts the dataset tests. PR for addressing this: NVIDIA-Merlin/models#1021